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We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence 
(AGI) models and their precursors. This framework introduces levels of AGI performance, generality, 
and autonomy. It is our hope that this framework will be useful in an analogous way to the levels of 
autonomous driving, by providing a common language to compare models, assess risks, and measure 
progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and 
distill six principles that a useful ontology for AGI should satisfy. These principles include focusing on 
capabilities rather than mechanisms; separately evaluating generality and performance; and defining 
stages along the path toward AGI, rather than focusing on the endpoint. With these principles in mind, 
we propose “Levels of AGI” based on depth (performance) and breadth (generality) of capabilities, and 
reflect on how current systems fit into this ontology. We discuss the challenging requirements for future 
benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we 
discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and 
emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and 
safe deployment of highly capable AI systems. 


Keywords: AI, AGI, Artificial General Intelligence, General AI, Human-Level AI, HLAI, ASI, frontier models, 
benchmarking, metrics, AI safety, AI risk, autonomous systems, Human-Al Interaction 


Introduction 


Artificial General Intelligence (AGI)! is an important and sometimes controversial concept in computing 
research, used to describe an AI system that is at least as capable as a human at most tasks. Given the 
rapid advancement of Machine Learning (ML) models, the concept of AGI has passed from being the 
subject of philosophical debate to one with near-term practical relevance. Some experts believe that 
“sparks” of AGI (Bubeck et al., 2023) are already present in the latest generation of large language 
models (LLMs); some predict AI will broadly outperform humans within about a decade (Bengio et al., 
2023); some even assert that current LLMs are AGIs (Agiiera y Arcas and Norvig, 2023). However, if 
you were to ask 100 AI experts to define what they mean by “AGI,” you would likely get 100 related 
but different definitions. 


The concept of AGI is important as it maps onto goals for, predictions about, and risks of AI: 


Goals: Achieving human-level “intelligence” is an implicit or explicit north-star goal for many 
in our field, from the 1955 Dartmouth AI Conference (McCarthy et al., 1955) that kick-started the 


1 There is controversy over use of the term “AGI." Some communities favor “General AI” or “Human-Level AI” (Gruet- 
zemacher and Paradice, 2019) as alternatives, or even simply “AI” as a term that now effectively encompasses AGI (or soon 
will, under optimistic predictions). However, AGI is a term of art used by both technologists and the general public, and is 
thus useful for clear communication. Similarly, for clarity we use commonly understood terms such as “Artificial Intelligence” 
and “Machine Learning,” although we are sympathetic to critiques (Bigham, 2019) that these terms anthropomorphize 
computing systems. 
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modern field of AI to some of today’s leading AI research firms whose mission statements allude to 
concepts such as “ensure transformative AI helps people and society” (Anthropic, 2023a) or “ensure 
that artificial general intelligence benefits all of humanity” (OpenAI, 2023). 


Predictions: The concept of AGI is related to a prediction about progress in AI, namely that 
it is toward greater generality, approaching and exceeding human generality. Additionally, AGI is 
typically intertwined with a notion of “emergent” properties (Wei et al., 2022), i.e. capabilities not 
explicitly anticipated by the developer. Such capabilities offer promise, perhaps including abilities 
that are complementary to typical human skills, enabling new types of interaction or novel industries. 
Such predictions about AGI’s capabilities in turn predict likely societal impacts; AGI may have 
significant economic implications, i.e., reaching the necessary criteria for widespread labor substitution 
(Dell’Acqua et al., 2023; Ellingrud et al., 2023), as well as geo-political implications relating not 
only to the economic advantages AGI may confer, but also to military considerations (Kissinger et al., 
2022). 


Risks: Lastly, AGI is viewed by some as a concept for identifying the point when there are extreme 
risks (Bengio et al., 2023; Shevlane et al., 2023), as some speculate that AGI systems might be able 
to deceive and manipulate, accumulate resources, advance goals, behave agentically, outwit humans 
in broad domains, displace humans from key roles, and/or recursively self-improve. 


In this paper, we argue that it is critical for the AI research community to explicitly reflect on what 
we mean by "AGI," and aspire to quantify attributes like the performance, generality, and autonomy 
of AI systems. Shared operationalizable definitions for these concepts will support: comparisons 
between models; risk assessments and mitigation strategies; clear criteria from policymakers and 
regulators; identifying goals, predictions, and risks for research and development; and the ability to 
understand and communicate where we are along the path to AGI. 


Defining AGI: Case Studies 


Many AI researchers and organizations have proposed definitions of AGI. In this section, we consider 
nine prominent examples, and reflect on their strengths and limitations. This analysis informs our 
subsequent introduction of a two-dimensional, leveled ontology of AGI. 


Case Study 1: The Turing Test. The Turing Test (Turing, 1950) is perhaps the most well-known 
attempt to operationalize an AGI-like concept. Turing’s “imitation game” was posited as a way to 
operationalize the question of whether machines could think, and asks a human to interactively 
distinguish whether text is produced by another human or by a machine. The test as originally framed 
is a thought experiment, and is the subject of many critiques (Wikipedia, 2023b); in practice, the 
test often highlights the ease of fooling people (Weizenbaum, 1966; Wikipedia, 2023a) rather than 
the “intelligence” of the machine. Given that modern LLMs pass some framings of the Turing Test, 
it seems clear that this criteria is insufficient for operationalizing or benchmarking AGI. We agree 
with Turing that whether a machine can “think,” while an interesting philosophical and scientific 
question, seems orthogonal to the question of what the machine can do; the latter is much more 
straightforward to measure and more important for evaluating impacts. Therefore we propose that 


AGI should be defined in terms of capabilities rather than processes?. 


Case Study 2: Strong AI - Systems Possessing Consciousness. Philosopher John Searle 
mused, "according to strong AI, the computer is not merely a tool in the study of the mind; rather, 
the appropriately programmed computer really is a mind, in the sense that computers given the 


2 As research into mechanistic interpretability (Rauker et al., 2023) advances, it may enable process-oriented metrics. 
These may be relevant to future definitions of AGI. 
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right programs can be literally said to understand and have other cognitive states" (Searle, 1980). 
While strong AI might be one path to achieving AGI, there is no scientific consensus on methods 
for determining whether machines possess strong AI attributes such as consciousness (Butlin et al., 
2023), making the process-oriented focus of this framing impractical. 


Case Study 3: Analogies to the Human Brain. The original use of the term "artificial general 
intelligence" was in a 1997 article about military technologies by Mark Gubrud (Gubrud, 1997), 
which defined AGI as “AI systems that rival or surpass the human brain in complexity and speed, that 
can acquire, manipulate and reason with general knowledge, and that are usable in essentially any 
phase of industrial or military operations where a human intelligence would otherwise be needed.” 
This early definition emphasizes processes (rivaling the human brain in complexity) in addition to 
capabilities; while neural network architectures underlying modern ML systems are loosely inspired 
by the human brain, the success of transformer-based architectures (Vaswani et al., 2023) whose 
performance is not reliant on human-like learning suggests that strict brain-based processes and 
benchmarks are not inherently necessary for AGI. 


Case Study 4: Human-Level Performance on Cognitive Tasks. Legg (Legg, 2008) and Goertzel 
(Goertzel, 2014) popularized the term AGI among computer scientists in 2001 (Legg, 2022), describing 
AGI as a machine that is able to do the cognitive tasks that people can typically do. This definition 
notably focuses on non-physical tasks (i.e., not requiring robotic embodiment as a precursor to AGI). 
Like many other definitions of AGI, this framing presents ambiguity around choices such as “what 
tasks?” and “which people?”. 


Case Study 5: Ability to Learn Tasks. In The Technological Singularity (Shanahan, 2015), 
Shanahan suggests that AGI is “Artificial intelligence that is not specialized to carry out specific tasks, 
but can learn to perform as broad a range of tasks as a human.” An important property of this framing 
is its emphasis on the value of including metacognitive tasks (learning) among the requirements for 
achieving AGI. 


Case Study 6: Economically Valuable Work. OpenAl’s charter defines AGI as “highly autonomous 
systems that outperform humans at most economically valuable work” (OpenAI, 2018). This definition 
has strengths per the “capabilities, not processes” criteria, as it focuses on performance agnostic to 
underlying mechanisms; further, this definition offers a potential yardstick for measurement, i.e., 
economic value. A shortcoming of this definition is that it does not capture all of the criteria that 
may be part of “general intelligence.” There are many tasks that are associated with intelligence 
that may not have a well-defined economic value (.g., artistic creativity or emotional intelligence). 
Such properties may be indirectly accounted for in economic measures (e.g., artistic creativity might 
produce books or movies, emotional intelligence might relate to the ability to be a successful CEO), 
though whether economic value captures the full spectrum of “intelligence” remains unclear. Another 
challenge with a framing of AGI in terms of economic value is that this implies a need for deployment 
of AGI in order to realize that value, whereas a focus on capabilities might only require the potential 
for an AGI to execute a task. We may well have systems that are technically capable of performing 
economically important tasks but don’t realize that economic value for varied reasons (legal, ethical, 
social, etc.). 


Case Study 7: Flexible and General - The "Coffee Test" and Related Challenges. Marcus 
suggests that AGI is “shorthand for any intelligence (there might be many) that is flexible and general, 
with resourcefulness and reliability comparable to (or beyond) human intelligence” (Marcus, 2022b). 
This definition captures both generality and performance (via the inclusion of reliability); the mention 
of “flexibility” is noteworthy, since, like the Shanahan formulation, this suggests that metacognitive 
tasks such as the ability to learn new skills must be included in an AGI’s set of capabilities in order to 
achieve sufficient generality. Further, Marcus operationalizes his definition by proposing five concrete 
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tasks (understanding a movie, understanding a novel, cooking in an arbitrary kitchen, writing a 
bug-free 10,000 line program, and converting natural language mathematical proofs into symbolic 
form) (Marcus, 2022a). Accompanying a definition with a benchmark is valuable; however, more 
work would be required to construct a sufficiently comprehensive benchmark. While we agree that 
failing some of these tasks indicates a system is not an AGI, it is unclear that passing them is sufficient 
for AGI status. In the Testing for AGI section, we further discuss the challenge in developing a set 
of tasks that is both necessary and sufficient for capturing the generality of AGI. We also note that 
one of Marcus’ proposed tasks, “work as a competent cook in an arbitrary kitchen” (a variant of 
Steve Wozniak’s “Coffee Test” (Wozniak, 2010)), requires robotic embodiment; this differs from other 
definitions that focus on non-physical tasks?. 


Case Study 8: Artificial Capable Intelligence. In The Coming Wave, Suleyman proposed the 
concept of "Artificial Capable Intelligence (ACI)" (Mustafa Suleyman and Michael Bhaskar, 2023) to 
refer to AI systems with sufficient performance and generality to accomplish complex, multi-step tasks 
in the open world. More specifically, Suleyman proposed an economically-based definition of ACI skill 
that he dubbed the “Modern Turing Test,” in which an AI would be given $100,000 of capital and 
tasked with turning that into $1,000,000 over a period of several months. This framing is more narrow 
than OpenAl’s definition of economically valuable work and has the additional downside of potentially 
introducing alignment risks (Kenton et al., 2021) by only targeting fiscal profit. However, a strength 
of Suleyman’s concept is the focus on performing a complex, multi-step task that humans value. 
Construed more broadly than making a million dollars, ACI’s emphasis on complex, real-world tasks 
is noteworthy, since such tasks may have more ecological validity than many current AI benchmarks; 
Marcus’ aforementioned five tests of flexibility and generality (Marcus, 2022a) seem within the spirit 
of ACI, as well. 


Case Study 9: SOTA LLMs as Generalists. Agiiera y Arcas and Norvig (Agiiera y Arcas and 
Norvig, 2023) suggested that state-of-the-art LLMs (e.g. mid-2023 deployments of GPT-4, Bard, Llama 
2, and Claude) already are AGIs, arguing that generality is the key property of AGI, and that because 
language models can discuss a wide range of topics, execute a wide range of tasks, handle multimodal 
inputs and outputs, operate in multiple languages, and “learn” from zero-shot or few-shot examples, 
they have achieved sufficient generality. While we agree that generality is a crucial characteristic of 
AGI, we posit that it must also be paired with a measure of performance (i.e., if an LLM can write code 
or perform math, but is not reliably correct, then its generality is not yet sufficiently performant). 


Defining AGI: Six Principles 


Reflecting on these nine example formulations of AGI (or AGI-adjacent concepts), we identify properties 
and commonalities that we feel contribute to a clear, operationalizable definition of AGI. We argue 
that any definition of AGI should meet the following six criteria: 


1. Focus on Capabilities, not Processes. The majority of definitions focus on what an AGI can 
accomplish, not on the mechanism by which it accomplishes tasks. This is important for identifying 
characteristics that are not necessarily a prerequisite for achieving AGI (but may nonetheless be 
interesting research topics). This focus on capabilities allows us to exclude the following from our 
requirements for AGI: 


e Achieving AGI does not imply that systems think or understand in a human-like way (since this 
focuses on processes, not capabilities) 


3 Though robotics might also be implied by the OpenAl charter’s focus on “economically valuable work,” the fact that 
OpenAI shut down its robotics research division in 2021 (Wiggers, 2021) suggests this is not their intended interpretation. 
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e Achieving AGI does not imply that systems possess qualities such as consciousness (subjective 
awareness) (Butlin et al., 2023) or sentience (the ability to have feelings) (since these qualities 
not only have a process focus, but are not currently measurable by agreed-upon scientific 
methods) 


2. Focus on Generality and Performance. All of the above definitions emphasize generality 
to varying degrees, but some exclude performance criteria. We argue that both generality and 
performance are key components of AGI. In the next section we introduce a leveled taxonomy that 
considers the interplay between these dimensions. 


3. Focus on Cognitive and Metacognitive Tasks. Whether to require robotic embodiment (Roy 
et al., 2021) as a criterion for AGI is a matter of some debate. Most definitions focus on cognitive 
tasks, by which we mean non-physical tasks. Despite recent advances in robotics (Brohan et al., 
2023), physical capabilities for AI systems seem to be lagging behind non-physical capabilities. It is 
possible that embodiment in the physical world is necessary for building the world knowledge to be 
successful on some cognitive tasks (Shanahan, 2010), or at least may be one path to success on some 
classes of cognitive tasks; if that turns out to be true then embodiment may be critical to some paths 
toward AGI. We suggest that the ability to perform physical tasks increases a system’s generality, but 
should not be considered a necessary prerequisite to achieving AGI. On the other hand, metacognitive 
capabilities (such as the ability to learn new tasks or the ability to know when to ask for clarification 
or assistance from a human) are key prerequisites for systems to achieve generality. 


4. Focus on Potential, not Deployment. Demonstrating that a system can perform a requisite 
set of tasks at a given level of performance should be sufficient for declaring the system to be an 
AGI; deployment of such a system in the open world should not be inherent in the definition of AGI. 
For instance, defining AGI in terms of reaching a certain level of labor substitution would require 
real-world deployment, whereas defining AGI in terms of being capable of substituting for labor would 
focus on potential. Requiring deployment as a condition of measuring AGI introduces non-technical 
hurdles such as legal and social considerations, as well as potential ethical and safety concerns. 


5. Focus on Ecological Validity. Tasks that can be used to benchmark progress toward AGI are 
critical to operationalizing any proposed definition. While we discuss this further in the “Testing for 
AGI” section, we emphasize here the importance of choosing tasks that align with real-world (i.e., 
ecologically valid) tasks that people value (construing “value” broadly, not only as economic value but 
also social value, artistic value, etc.). This may mean eschewing traditional AI metrics that are easy to 
automate or quantify (Raji et al., 2021) but may not capture the skills that people would value in an 
AGI. 


6. Focus on the Path to AGI, not a Single Endpoint. Much as the adoption of a standard set of 
Levels of Driving Automation (SAE International, 2021) allowed for clear discussions of policy and 
progress relating to autonomous vehicles, we posit there is value in defining “Levels of AGI.” As we 
discuss in subsequent sections, we intend for each level of AGI to be associated with a clear set of 
metrics/benchmarks, as well as identified risks introduced at each level, and resultant changes to 
the Human-AI Interaction paradigm (Morris et al., 2023). This level-based approach to defining AGI 
supports the coexistence of many prominent formulations — for example, Aguera y Arcas & Norvig’s 
definition (Agiiera y Arcas and Norvig, 2023) would fall into the “Emerging AGI” category of our 
ontology, while OpenAl’s threshold of labor replacement (OpenAl, 2018) better matches “Virtuoso 
AGI.” Our “Competent AGI” level is probably the best catch-all for many existing definitions of AGI 
(e.g., the Legg (Legg, 2008), Shanahan (Shanahan, 2015), and Suleyman (Mustafa Suleyman and 
Michael Bhaskar, 2023) formulations). In the next section, we introduce a level-based ontology of 
AGI. 
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Levels of AGI 


Performance (rows) x 
Generality (columns) 


Level 0: No AI 


Narrow 
clearly scoped task or set of tasks 


Narrow Non-AlI 
calculator software; compiler 


General 

wide range of non-physical tasks, 
including metacognitive abilities 
like learning new skills 

General Non-AI 
human-in-the-loop computing, 
e.g., Amazon Mechanical Turk 


Level 1: Emerging 
equal to or somewhat better than 
an unskilled human 


Emerging Narrow AI 

GOFAI‘; simple rule-based sys- 
tems, e.g., SHRDLU (Winograd, 
1971) 


Emerging AGI 

ChatGPT (OpenAl, 2023), Bard 
(Anil et al., 2023), Llama 2 
(Touvron et al., 2023) 


Level 2: Competent 
at least 50th percentile of skilled 
adults 


Competent Narrow AI 

toxicity detectors such as Jig- 
saw (Das et al., 2022); Smart 
Speakers such as Siri (Apple), 
Alexa (Amazon), or Google As- 
sistant (Google); VQA systems 
such as PaLI (Chen et al., 2023); 
Watson (IBM); SOTA LLMs for a 
subset of tasks (e.g., short essay 
writing, simple coding) 


Competent AGI 
not yet achieved 


Level 3: Expert 
at least 90th percentile of skilled 
adults 


Expert Narrow AI 

spelling & grammar checkers 
such as Grammarly (Gram- 
marly, 2023); generative im- 
age models such as Imagen (Sa- 
haria et al., 2022) or Dall-E 2 
(Ramesh et al., 2022) 


Expert AGI 
not yet achieved 


Level 4: Virtuoso 
at least 99th percentile of skilled 
adults 


Virtuoso Narrow AI 

Deep Blue (Campbell et al., 
2002), AlphaGo (Silver et al., 
2016, 2017) 


Virtuoso AGI 
not yet achieved 


Level 5: Superhuman 
outperforms 100% of humans 


Superhuman Narrow AI 
AlphaFold (Jumper et al., 2021; 
Varadi et al., 2021), AlphaZero 
(Silver et al., 2018), StockFish 
(Stockfish, 2023) 


Artificial 
(ASI) 
not yet achieved 


Superintelligence 


Table 1 | A leveled, matrixed approach toward classifying systems on the path to AGI based on 
depth (performance) and breadth (generality) of capabilities. Example systems in each cell are 
approximations based on current descriptions in the literature or experiences interacting with deployed 
systems. Unambiguous classification of AI systems will require a standardized benchmark of tasks, as 
we discuss in the Testing for AGI section. 


In accordance with Principle 2 ("Focus on Generality and Performance") and Principle 6 ("Focus 
on the Path to AGI, not a Single Endpoint"), in Table 1 we introduce a matrixed leveling system that 
focuses on performance and generality as the two dimensions that are core to AGI: 
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¢ Performance refers to the depth of an AI system’s capabilities, i.e., how it compares to human- 
level performance for a given task. Note that for all performance levels above “Emerging,” 
percentiles are in reference to a sample of adults who possess the relevant skill .g., “Competent” 
or higher performance on a task such as English writing ability would only be measured against 
the set of adults who are literate and fluent in English). 

e Generality refers to the breadth of an AI system’s capabilities, i.e., the range of tasks for which 
an AI system reaches a target performance threshold. 


This taxonomy specifies the minimum performance over most tasks needed to achieve a given 
rating — e.g., a Competent AGI must have performance at least at the 50th percentile for skilled adult 
humans on most cognitive tasks, but may have Expert, Virtuoso, or even Superhuman performance 
on a subset of tasks. As an example of how individual systems may straddle different points in our 
taxonomy, we posit that as of this writing in September 2023, frontier language models (e.g., ChatGPT 
(OpenAl, 2023), Bard (Anil et al., 2023), Llama2 (Touvron et al., 2023), etc.) exhibit “Competent” 
performance levels for some tasks (e.g., short essay writing, simple coding), but are still at “Emerging” 
performance levels for most tasks (e.g., mathematical abilities, tasks involving factuality). Overall, 
current frontier language models would therefore be considered a Level 1 General AI (“Emerging 
AGI”) until the performance level increases for a broader set of tasks (at which point the Level 2 
General AI, “Competent AGI,” criteria would be met). We suggest that documentation for frontier 
AI models, such as model cards (Mitchell et al., 2019), should detail this mixture of performance 
levels. This will help end-users, policymakers, and other stakeholders come to a shared, nuanced 
understanding of the likely uneven performance of systems progressing along the path to AGI. 


The order in which stronger skills in specific cognitive areas are acquired may have serious 
implications for AI safety (e.g., acquiring strong knowledge of chemical engineering before acquiring 
strong ethical reasoning skills may be a dangerous combination). Note also that the rate of progression 
between levels of performance and/or generality may be nonlinear. Acquiring the capability to learn 
new skills may particularly accelerate progress toward the next level. 


While this taxonomy rates systems according to their performance, systems that are capable of 
achieving a certain level of performance (e.g., against a given benchmark) may not match this level 
in practice when deployed. For instance, user interface limitations may reduce deployed performance. 
Consider the example of DALLE-2 (Ramesh et al., 2022), which we estimate as a Level 3 Narrow AI 
(“Expert Narrow AI”) in our taxonomy. We estimate the “Expert” level of performance since DALLE-2 
produces images of higher quality than most people are able to draw; however, the system has 
failure modes (e.g., drawing hands with incorrect numbers of digits, rendering nonsensical or illegible 
text) that prevent it from achieving a “Virtuoso” performance designation. While theoretically an 
“Expert” level system, in practice the system may only be “Competent,” because prompting interfaces 
are too complex for most end-users to elicit optimal performance (as evidenced by the existence of 
marketplaces (e.g., (PromptBase)) in which skilled prompt engineers sell prompts). This observation 
emphasizes the importance of designing ecologically valid benchmarks (that would measure deployed 
rather than idealized performance) as well as the importance of considering how human-Al interaction 
paradigms interact with the notion of AGI (a topic we return to in the “Capabilities vs. Autonomy” 
Section). 


The highest level in our matrix in terms of combined performance and generality is ASI (Artificial 
Superintelligence). We define "Superhuman" performance as outperforming 100% of humans. For 
instance, we posit that AlphaFold (Jumper et al., 2021; Varadi et al., 2021) is a Level 5 Narrow 
AI ("Superhuman Narrow AI") since it performs a single task (predicting a protein’s 3D structure 
from an amino acid sequence) above the level of the world’s top scientists. This definition means 
that Level 5 General AI ("ASI") systems will be able to do a wide range of tasks at a level that no 
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human can match. Additionally, this framing also implies that Superhuman systems may be able to 
perform an even broader generality of tasks than lower levels of AGI, since the ability to execute tasks 
that qualitiatively differ from existing human skills would by definition outperform all humans (who 
fundamentally cannot do such tasks). For example, non-human skills that an ASI might have could 
include capabilities such as neural interfaces (perhaps through mechanisms such as analyzing brain 
signals to decode thoughts (Tang et al., 2023)), oracular abilities (perhaps through mechanisms such 
as analyzing large volumes of data to make high-quality predictions), or the ability to communicate 
with animals (perhaps by mechanisms such as analyzing patterns in their vocalizations, brain waves, 
or body language). 


Testing for AGI 


Two of our six proposed principles for defining AGI (Principle 2: Generality and Performance; Principle 
6: Focus on the Path to AGI) influenced our choice of a matrixed, leveled ontology for facilitating 
nuanced discussions of the breadth and depth of AI capabilities. Our remaining four principles 
(Principle 1: Capabilities, not Processes; Principle 3: Cognitive and Metacognitive Tasks; Principle 4: 
Potential, not Deployment; and Principle 5: Ecological Validity) relate to the issue of measurement. 


While our performance dimension specifies one aspect of measurement (e.g., percentile ranges 
for task performance relative to particular subsets of people), our generality dimension leaves open 
important questions: What is the set of tasks that constitute the generality criteria? What proportion 
of such tasks must an AI system master to achieve a given level of generality in our schema? Are there 
some tasks that must always be performed to meet the criteria for certain generality levels, such as 
metacognitive tasks? 


Operationalizing an AGI definition requires answering these questions, as well as developing 
specific diverse and challenging tasks. Because of the immense complexity of this process, as well 
as the importance of including a wide range of perspectives (including cross-organizational and 
multi-disciplinary viewpoints), we do not propose a benchmark in this paper. Instead, we work to 
clarify the ontology a benchmark should attempt to measure. We also discuss properties an AGI 
benchmark should possess. 


Our intent is that an AGI benchmark would include a broad suite of cognitive and metacognitive 
tasks (per Principle 3), measuring diverse properties including (but not limited to) linguistic intel- 
ligence, mathematical and logical reasoning (Webb et al., 2023), spatial reasoning, interpersonal 
and intra-personal social intelligences, the ability to learn new skills and creativity. A benchmark 
might include tests covering psychometric categories proposed by theories of intelligence from psy- 
chology, neuroscience, cognitive science, and education; however, such “traditional” tests must first 
be evaluated for suitability for benchmarking computing systems, since many may lack ecological 
and construct validity in this context (Serapio-Garcia et al., 2023). 


One open question for benchmarking performance is whether to allow the use of tools, including 
potentially Al-powered tools, as an aid to human performance. This choice may ultimately be task 
dependent and should account for ecological validity in benchmark choice (per Principle 5). For 
example, in determining whether a self-driving car is sufficiently safe, benchmarking against a person 
driving without the benefit of any modern Al-assisted safety tools would not be the most informative 
comparison; since the relevant counterfactual involves some driver-assistance technology, we may 
prefer a comparison to that baseline. 


While an AGI benchmark might draw from some existing AI benchmarks (Lynch, 2023) (e.g., 
HELM (Liang et al., 2023), BIG-bench (Srivastava et al., 2023)), we also envision the inclusion of 
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open-ended and/or interactive tasks that might require qualitative evaluation (Bubeck et al., 2023; 
Papakyriakopoulos et al., 2021; Yang et al., 2023). We suspect that these latter classes of complex, 
open-ended tasks, though difficult to benchmark, will have better ecological validity than traditional 
AI metrics, or than adapted traditional measures of human intelligence. 


It is impossible to enumerate the full set of tasks achievable by a sufficiently general intelligence. 
As such, an AGI benchmark should be a living benchmark. Such a benchmark should therefore include 
a framework for generating and agreeing upon new tasks. 


Determining that something is not an AGI at a given level simply requires identifying several? tasks 
that people can typically do but the system cannot adequately perform. Systems that pass the majority 
of the envisioned AGI benchmark at a particular performance level ("Emerging," "Competent," etc.), 
including new tasks added by the testers, can be assumed to have the associated level of generality 
for practical purposes (i.e., though in theory there could still be a test the AGI would fail, at some 
point unprobed failures are so specialized or atypical as to be practically irrelevant). 


Developing an AGI benchmark will be a challenging and iterative process. It is nonetheless a 
valuable north-star goal for the AI research community. Measurement of complex concepts may be 
imperfect, but the act of measurement helps us crisply define our goals and provides an indicator of 
progress. 


Risk in Context: Autonomy and Human-AI Interaction 


Discussions of AGI often include discussion of risk, including "x-risk" — existential (for AI Safety, 2023) 
or other very extreme risks (Shevlane et al., 2023). A leveled approach to defining AGI enables a 
more nuanced discussion of how different combinations of performance and generality relate to 
different types of AI risk. While there is value in considering extreme risk scenarios, understanding 
AGI via our proposed ontology rather than as a single endpoint (per Principle 6) can help ensure that 
policymakers also identify and prioritize risks in the near-term and on the path to AGI. 


Levels of AGI as a Framework for Risk Assessment 


As we advance along our capability levels toward ASI, new risks are introduced, including misuse 
risks, alignment risks, and structural risks (Zwetsloot and Dafoe, 2019). For example, the “Expert AGI” 
level is likely to involve structural risks related to economic disruption and job displacement, as more 
and more industries reach the substitution threshold for machine intelligence in lieu of human labor. 
On the other hand, reaching “Expert AGI” likely alleviates some risks introduced by “Emerging AGI” 
and “Competent AGI,” such as the risk of incorrect task execution. The “Virtuoso AGI” and “ASI” levels 
are where many concerns relating to x-risk are most likely to emerge (e.g., an AI that can outperform 
its human operators on a broad range of tasks might deceive them to achieve a mis-specified goal, as 
in misalignment thought experiments (Christian, 2020)). 


Systemic risks such as destabilization of international relations may be a concern if the rate of 
progression between levels outpaces regulation or diplomacy (e.g., the first nation to achieve ASI may 
have a substantial geopolitical/military advantage, creating complex structural risks). At levels below 


5 We hesitate to specify the precise number or percentage of tasks that a system must pass at a given level of performance 
in order to be declared a General AI at that Level (e.g., a rule such as "a system must pass at least 90% of an AGI benchmark 
at a given performance level to get that rating"). While we think this will be a very high percentage, it will probably not 
be 100%, since it seems clear that broad but imperfect generality is impactful (individual humans also lack consistent 
performance across all possible tasks, but remain generally intelligent). Determining what portion of benchmarking tasks at 
a given level demonstrate generality remains an open research question. 


Levels of AGI: Operationalizing Progress on the Path to AGI 


“Expert AGI” (e.g., “Emerging AGI,” “Competent AGI,” and all “Narrow” AI categories), risks likely 
stem more from human actions (e.g., risks of AI misuse, whether accidental, incidental, or malicious). 
A more complete analysis of risk profiles associated with each level is a critical step toward developing 
a taxonomy of AGI that can guide safety/ethics research and policymaking. 


We acknowledge that whether an AGI benchmark should include tests for potentially dangerous 
capabilities (e.g., the ability to deceive, to persuade (Veerabadran et al., 2023), or to perform advanced 
biochemistry (Morris, 2023)) is controversial. We lean on the side of including such capabilities 
in benchmarking, since most such skills tend to be dual use (having valid applications to socially 
positive scenarios as well as nefarious ones). Dangerous capability benchmarking can be de-risked 
via Principle 4 (Potential, not Deployment) by ensuring benchmarks for any dangerous or dual-use 
tasks are appropriately sandboxed and not defined in terms of deployment. However, including such 
tests in a public benchmark may allow malicious actors to optimize for these abilities; understanding 
how to mitigate risks associated with benchmarking dual-use abilities remains an important area for 
research by AI safety, AI ethics, and AI governance experts. 


Concurrent with this work, Anthropic released Version 1.0 of its Responsible Scaling Policy (RSP) 
(Anthropic, 2023b). This policy uses a levels-based approach (inspired by biosafety level standards) 
to define the level of risk associated with an AI system, identifying what dangerous capabilities may 
be associated with each AI Safety Level (ASL), and what containment or deployment measures should 
be taken at each level. Current SOTA generative Als are classified as an ASL-2 risk. Including items 
matched to ASL capabilities in any AGI benchmark would connect points in our AGI taxonomy to 
specific risks and mitigations. 


Capabilities vs. Autonomy 


While capabilities provide prerequisites for AI risks, AI systems (including AGI systems) do not and 
will not operate in a vacuum. Rather, AI systems are deployed with particular interfaces and used to 
achieve particular tasks in specific scenarios. These contextual attributes (interface, task, scenario, 
end-user) have substantial bearing on risk profiles. AGI capabilities alone do not determine destiny 
with regards to risk, but must be considered in combination with contextual details. 


Consider, for instance, the affordances of user interfaces for AGI systems. Increasing capabilities 
unlock new interaction paradigms, but do not determine them. Rather, system designers and end- 
users will settle on a mode of human-Al interaction (Morris et al., 2023) that balances a variety of 
considerations, including safety. We propose characterizing human-Al interaction paradigms with six 
Levels of Autonomy, described in Table 2. 


These Levels of Autonomy are correlated with the Levels of AGI. Higher levels of autonomy are 
“unlocked” by AGI capability progression, though lower levels of autonomy may be desirable for 
particular tasks and contexts (including for safety reasons) even as we reach higher levels of AGI. 
Carefully considered choices around human-Al interaction are vital to safe and responsible deployment 
of frontier AI models. 


We emphasize the importance of the “No AI” paradigm. There may be many situations where this 
is desirable, including for education, enjoyment, assessment, or safety reasons. For example, in the 
domain of self-driving vehicles, when Level 5 Self-Driving technology is widely available, there may 
be reasons for using a Level 0 (No Automation) vehicle. These include for instructing a new driver 
(education), for pleasure by driving enthusiasts (enjoyment), for driver’s licensing exams (assessment), 
or in conditions where sensors cannot be relied upon such as technology failures or extreme weather 
events (safety). While Level 5 Self-Driving (SAE International, 2021) vehicles would likely be a Level 
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AI as a Tool 
human fully controls 
task and uses AI to 


aid of a search engine 


Revising writing with the 


Emerging Narrow AI 


Likely: 


Autonomy Level Example Systems Unlocking Example Risks 
AGI Level(s) Introduced 

Autonomy Level 0: | Analogue approaches (e.g., | No AI n/a (status quo risks) 
No AI sketching with pencil on paper) 
human does every- 
thing Non-AI digital workflows 

(e.g., typing in a text editor; 

drawing in a paint program) 
Autonomy Level 1: | Information-seeking with the | Possible: de-skilling 


(e.g., over-reliance) 


disruption of 


AI as an Expert 


scientific discovery (e.g., protein- 


Virtuoso Narrow AI 


automate mundane | aid of a grammar-checking | Competent Narrow | established 
sub-tasks program Al industries 

Reading a sign with a 

machine translation app 
Autonomy Level 2: | Relying on a language model | Possible: over-trust 
AI as a Consultant | to summarize a set of documents | Competent Narrow 
Al takes on a Al radicalization 
substantive role, but | Accelerating computer program- 
only when invoked by | ming with a code-generating | Likely: targeted 
a human model Expert Narrow AJ; manipulation 

Emerging AGI 

Consuming most entertain- 

ment via a sophisticated 

recommender system 
Autonomy Level 3: | Training as a chess player | Possible: anthropomorphization 
Alas a through interactions with and | Emerging AGI (e.g., parasocial 
Collaborator analysis of a chess-playing AI relationships) 
co-equal human-AI Likely: 
collaboration; inter- | Entertainment via social | Expert Narrow AI; rapid societal change 
active coordination | interactions with Al-generated | Competent AGI 
of goals & tasks personalities 
Autonomy Level 4: | Using an AI system to advance | Possible: societal-scale ennui 


AI drives interaction; | folding) mass labor 
human provides Likely: displacement 
guidance & feedback Expert AGI 
or performs subtasks decline of human 
exceptionalism 
Autonomy Level 5: | Autonomous Al-powered Likely: misalignment 
AI as an Agent personal assistants Virtuoso AGI; 
fully autonomous AI | (not yet unlocked) ASI concentration 
of power 


Table 2 | More capable AI systems unlock new human-AI interaction paradigms (including fully 
autonomous AI). The choice of appropriate autonomy level need not be the maximum achievable 
given the capabilities of the underlying model. One consideration in the choice of autonomy level 
are resulting risks. This table’s examples illustrate the importance of carefully considering human-AI 
interaction design decisions. 
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5 Narrow AI (“Superhuman Narrow AI”) under our taxonomy®, the same considerations regarding 
human vs. computer autonomy apply to AGIs. We may develop an AGI, but choose not to deploy it 
autonomously (or choose to deploy it with differentiated autonomy levels in distinct circumstances as 
dictated by contextual considerations). 


Certain aspects of generality may be required to make particular interaction paradigms desirable. 
For example, the Autonomy Levels 3, 4, and 5 ("Collaborator," "Expert," and "Agent") may only work 
well if an AI system also demonstrates strong performance on certain metacognitive abilities (learning 
when to ask a human for help, theory of mind modeling, social-emotional skills). Implicit in our 
definition of Autonomy Level 5 ("AI as an Agent") is that such a fully autonomous AI can act in an 
aligned fashion without continuous human oversight, but knows when to consult humans (Shah et al., 
2021). Interfaces that support human-Al alignment through better task specification, the bridging of 
process gulfs, and evaluation of outputs (Terry et al., 2023) are a vital area of research for ensuring 
that the field of human-computer interaction keeps pace with the challenges and opportunities of 
interacting with AGI systems. 


Human-AI Interaction Paradigm as a Framework for Risk Assessment 


Table 2 illustrates the interplay between AGI Level, Autonomy Level, and risk. Advances in model 
performance and generality unlock additional interaction paradigm choices (including potentially fully 
autonomous AI). These interaction paradigms in turn introduce new classes of risk. The interplay of 
model capabilities and interaction design will enable more nuanced risk assessments and responsible 
deployment decisions than considering model capabilities alone. 


Table 2 also provides concrete examples of each of our six proposed Levels of Autonomy. For each 
level of autonomy, we indicate the corresponding levels of performance and generality that "unlock" 
that interaction paradigm (i.e., levels of AGI at which it is possible or likely for that paradigm to be 
successfully deployed and adopted). 


Our predictions regarding "unlocking" levels tend to require higher levels of performance for 
Narrow than for General AI systems; for instance, we posit that the use of AI as a Consultant is 
likely with either an Expert Narrow AI or an Emerging AGI. This discrepancy reflects the fact that for 
General systems, capability development is likely to be uneven; for example, a Level 1 General AI 
("Emerging AGI") is likely to have Level 2 or perhaps even Level 3 performance across some subset of 
tasks. Such unevenness of capability for General Als may unlock higher autonomy levels for particular 
tasks that are aligned with their specific strengths. 


Considering AGI systems in the context of use by people allows us to reflect on the interplay 
between advances in models and advances in human-Al interaction paradigms. The role of model 
building research can be seen as helping systems’ capabilities progress along the path to AGI in their 
performance and generality, such that an AI system’s abilities will overlap an increasingly large portion 
of human abilities. Conversely, the role of human-Al interaction research can be viewed as ensuring 
new AI systems are usable by and useful to people such that AI systems successfully extend people’s 
capabilities (i.e., "intelligence augmentation" (Brynjolfsson, 2022)). 


6 Fully autonomous vehicles might arguably be classified as Level 4 Narrow AI ("Virtuoso Narrow Al") per our tax- 
onomy; however, we suspect that in practice autonomous vehicles may need to reach the Superhuman performance 
standard to achieve widespread social acceptance regarding perceptions of safety, illustrating the importance of contextual 
considerations. 
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Conclusion 


Artificial General Intelligence (AGI) is a concept of both aspirational and practical consequences. In this 
paper, we analyzed nine prominent definitions of AGI, identifying strengths and weaknesses. Based 
on this analysis, we introduce six principles we believe are necessary for a clear, operationalizable 
definition of AGI: focusing on capabilities, not processes; focusing on generality and performance; 
focusing on cognitive and metacognitive (rather than physical) tasks; focusing on potential rather 
than deployment; focusing on ecological validity for benchmarking tasks; and focusing on the path 
toward AGI rather than a single endpoint. 


With these principles in mind, we introduced our Levels of AGI ontology, which offers a more 
nuanced way to define our progress toward AGI by considering generality (either Narrow or General) 
in tandem with five levels of performance (Emerging, Competent, Expert, Virtuoso, and Superhuman). 
We reflected on how current AI systems and AGI definitions fit into this framing. Further, we discussed 
the implications of our principles for developing a living, ecologically valid AGI benchmark, and argue 
that such an endeavor (while sure to be challenging) is a vital one for our community to engage with. 


Finally, we considered how our principles and ontology can reshape discussions around the risks 
associated with AGI. Notably, we observed that AGI is not necessarily synonymous with autonomy. 
We introduced Levels of Autonomy that are unlocked, but not determined by, progression through 
the Levels of AGI. We illustrated how considering AGI Level jointly with Autonomy Level can provide 
more nuanced insights into likely risks associated with AI systems, underscoring the importance of 
investing in human-Al interaction research in tandem with model improvements. 
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